# A tibble: 10 × 5
# Groups: exerany [2]
exerany health n mean stdev
<int> <fct> <int> <dbl> <dbl>
1 0 E 18 36.9 4.70
2 0 VG 54 38.6 7.50
3 0 G 58 34.9 8.51
4 0 F 31 30.7 8.49
5 0 P 8 29.7 7.24
6 1 E 92 39.9 6.50
7 1 VG 191 38.5 6.87
8 1 G 152 35.8 7.20
9 1 F 49 38.2 7.63
10 1 P 17 37.9 10.4
Code for Interaction Plot
ggplot(summaries_1, aes(x = health, y = mean, col =factor(exerany))) +geom_point(size =2) +geom_line(aes(group =factor(exerany))) +scale_color_viridis_d(option ="C", end =0.5) +labs(title ="Observed Means of 1000/BMI",subtitle ="by Exercise and Overall Health")
Note the use of factor here since the exerany variable is in fact numeric, although it only takes the values 1 and 0.
Sometimes it’s helpful to treat 1/0 as a factor, and sometimes not.
Where is the evidence of serious non-parallelism (if any) in the plot on the next slide that results from this code?
Resulting Interaction Plot
Fitting a Two-Way ANOVA model for 1000/BMI
Model m1 without interaction
m1 <-lm(1000/bmi ~ exerany + health, data = train_c4im)
This looks very different from our previous version of the model. What happens when we make a prediction, though?
Prediction in the Orthogonal Polynomial Model
Remember that in our raw polynomial model, our “by hand” and “using R” calculations each predicted 1000/bmi for a subject with fruit_day = 2 to be 37.981.
What happens with the orthogonal polynomial model temp2?
temp1_aug <-augment(temp1, train_c4im)temp2_aug <-augment(temp2, train_c4im)p1 <-ggplot(temp1_aug, aes(x = fruit_day, y =1000/bmi)) +geom_point(alpha =0.3) +geom_line(aes(x = fruit_day, y = .fitted), col ="red", size =2) +labs(title ="temp1: Raw fit, degree 2")p2 <-ggplot(temp2_aug, aes(x = fruit_day, y =1000/bmi)) +geom_point(alpha =0.3) +geom_line(aes(x = fruit_day, y = .fitted), col ="blue", size =2) +labs(title ="temp2: Orthogonal fit, degree 2")p1 + p2 +plot_annotation(title ="Comparing Two Methods of Fitting a Quadratic Polynomial")
The two models are, in fact, identical.
Fits of raw vs orthogonal polynomials
Why use orthogonal polynomials?
The main reason is to avoid having to include powers of our predictor that are highly collinear.
Variance Inflation Factor assesses collinearity…
rms::vif(temp1) ## from rms package
fruit_day I(fruit_day^2)
4.652178 4.652178
Orthogonal polynomial terms are uncorrelated…
rms::vif(temp2)
poly(fruit_day, 2)1 poly(fruit_day, 2)2
1 1
Why orthogonal polynomials?
The tradeoff is that the raw polynomial is a lot easier to explain in terms of a single equation in the simplest case.
Actually, we’ll usually avoid polynomials in our practical work, and instead use splines, which are more flexible and require less maintenance, but at the cost of pretty much requiring you to focus on visualizing their predictions rather than their equations.
female is based on biological sex (1 = female, 0 = male)
exerany comes from a response to “During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?” (1 = yes, 0 = no, don’t know and refused = missing)
The variable is based on “Would you say that in general your health is …” using the five specified categories (Excellent -> Poor), numbered for convenience after data collection.
Don’t know / not sure / refused treated as missing.
Might want to run a sanity check here, just to be sure…
Checking health vs. genhealth in c4
c4 |>tabyl(genhealth, health) |>adorn_title()
health
genhealth E VG G F P NA_
1_Excellent 148 0 0 0 0 0
2_VeryGood 0 324 0 0 0 0
3_Good 0 0 274 0 0 0
4_Fair 0 0 0 112 0 0
5_Poor 0 0 0 0 35 0
<NA> 0 0 0 0 0 1
OK. We’ve preserved the order and we have much shorter labels. Sometimes, that’s helpful.
Multicategorical race_eth in raw c4
c4 |>count(race_eth)
# A tibble: 6 × 2
race_eth n
<fct> <int>
1 Black non-Hispanic 167
2 Hispanic 27
3 Multiracial non-Hispanic 19
4 Other race non-Hispanic 22
5 White non-Hispanic 646
6 <NA> 13
“Don’t know”, “Not sure”, and “Refused” were treated as missing.
What is this variable actually about?
What is the most common thing people do here?
What is the question you are asking?
Collapsing race_eth levels might be rational for some questions.
We have lots of data from two categories, but only two.
Systemic racism affects people of color in different ways across these categories, but also within them.
Is combining race and Hispanic/Latinx ethnicity helpful?
It’s hard to see the justice in collecting this information and not using it in as granular a form as possible, though this leaves some small sample sizes. There is no magic number for “too small a sample size.”
Most people identified themselves in one of the categories.
These data are not ordered, and (I’d argue) ordering them isn’t helpful.
Regression models are easier to interpret, though, if the “baseline” category is a common one.
Resorting the factor for race_eth
Let’s sort all five levels, from most observations to least…